http://condor.depaul.edu/ntomuro/courses/575/assign/HW2.html
The TED dataset "ted_main.csv" contains information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017,including number of views, number of comments, short descriptions, speaker names and titles.
Write code to obtain the following information of the words and tokens that appeared in the description of all talks (in the 'description' column in the dataset) after processing the text in the specified ways. Essentially, your task is to fill the following this table:
import pandas as pd
ted_raw = pd.read_csv('ted_main.csv', encoding = 'utf8')
print(ted_raw.columns)
ted_raw.head()
import nltk
from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
ted_raw['description'].head()
nobs = len(ted_raw['description'])
docs_token = [None] * nobs
docs_token_raw = [None] * nobs
docs_token_stem = [None] * nobs
porter = nltk.PorterStemmer()
for i in range(0,nobs):
docs_token_raw[i] = word_tokenize(ted_raw['description'][i])
# filter English stopwords + lowercase
# docs_token[i] = [w.lower() for w in docs_token_raw[i] if w not in stopwords.words('english')]
docs_token[i] = [w.lower() for w in docs_token_raw[i] if w.lower() not in stopwords.words('english')]
# filter tokens that contain non-alphabetic character(s)
docs_token[i] = [w for w in docs_token[i] if w.isalpha()]
docs_token_stem[i] = [porter.stem(tok) for tok in docs_token[i]] # apply stemmer
print(docs_token_raw[0])
print(docs_token[0])
print(docs_token_stem[0])
| [A] Word tokenization (only) |
[B] Word tokenization + Case folding (lower-case) + Stopword filtering + Non-alphabet filtering |
[C] Word tokenization + Case folding (lower-case) + Stopword filtering + Non-alphabet filtering + Porter stemming |
|
| (1) Total # of tokens | |||
| (2) Size of vocabulary | |||
| (3) Top 20 most common token types with frequency (list in descending order of frequency) |
|||
| (4) Percentage of tokens in the dataset that is covered by the top 20 token types |
NOTES:
def token_counter(docs_token_raw):
print('[1] '+str(len([token for doc_t in docs_token_raw for token in doc_t]))) # -> sum(fdist.values())
# Count frequencies of the vocabulary terms
fdist = nltk.FreqDist([token for doc_t in docs_token_raw for token in doc_t])
# fdist is essentially a Python dictionary
tfpairs = fdist.items()
print('[2] '+str(len(tfpairs))) # number of unique tokens
print('[3] '+str(fdist.most_common(20)))
print('[4] '+str(round(sum([item[1] for item in fdist.most_common(20)])/sum(fdist.values())*100,2))+'%')
print()
token_counter(docs_token_raw)
token_counter(docs_token)
token_counter(docs_token_stem)
The size decreased signigicantly because tokens are eliminated by filtering and are merged by case-folding.
The size decreased signigicantly because vacabs are stemmed and are merged into roots. For instance, having [entertainer, entertaining, entertainment] merged into one vocabulary
Because the stemming reduces the number of vocabulary and the number of tokens are the same, it is trivial that the most occured words would have more percentage share. For instance, talk in [B] only include tokens as 'talk' exact match whereas 'talk' in [C] might include [talk, talked, talks] in which is stemmed into a single vocabulary 'talk'.
| # | [A] | [B] | [C] |
|---|---|---|---|
| (1) # of tokens | 151994 | 78428 | 78428 |
| (2) Size of vocab | 17877 | 14816 | 10676 |
| (3) Top 20 common | [(',', 7382), ('.', 5764), ('the', 5395), ('and', 4264), ('of', 3651), ('to', 3528), ('a', 3505), ('in', 1762), ('--', 1485), ('that', 1472), ("'s", 1217), ('for', 1140), ('``', 898), ("''", 893), ('with', 879), ('we', 878), ('is', 834), ('it', 833), ('?', 824), ('this', 812)] | [('in', 762), ('talk', 700), ('us', 643), ('world', 515), ('new', 415), ('says', 411), ('people', 332), ('shares', 326), ('the', 306), ('shows', 282), ('life', 274), ('one', 272), ('ted', 254), ('like', 251), ('make', 239), ('way', 227), ('he', 224), ('human', 205), ('but', 205), ('work', 203)] | [('talk', 880), ('in', 762), ('us', 643), ('world', 527), ('say', 453), ('make', 449), ('share', 444), ('new', 415), ('show', 371), ('use', 360), ('work', 356), ('peopl', 334), ('human', 330), ('way', 326), ('one', 307), ('stori', 307), ('the', 306), ('live', 282), ('help', 281), ('life', 274)] |
| (4) % top 20 tokens in ds | 31.20% | 8.98% | 10.72% |
Using the tags associated with talks in the TED dataset, create a word cloud for tags 'climate change' and 'AI'.
There are many tools and reference sites available that help you create word clouds (such as this, this and a search result). Any will do. You pick one and figure out how to use it.
Make one cloud for each tag. Copy/paste the generated clouds in your submission file.
NOTE:

from wordcloud import WordCloud
ted_raw['tags'].head()
# convert string representation of list to list in Python
# https://www.tutorialspoint.com/How-to-convert-string-representation-of-list-to-list-in-Python
import ast
str(ted_raw['tags'].head()[1])
ast.literal_eval(str(ted_raw['tags'].head()[1]))
# for items in tags:
# print(ast.literal_eval(items))
# print(([tag for items in tags for tag in ast.literal_eval(items)]))
import ast
tags = ted_raw['tags']
# fdist = nltk.FreqDist([tag for items in tags for tag in ast.literal_eval(items)])
doc_tags = [(ast.literal_eval(items)) for items in tags]
doc_tags
i=0
docs_ai, docs_cli = [], []
for tags in doc_tags:
if('AI' in tags):
docs_ai.append(i)
if('climate change' in tags):
docs_cli.append(i)
i+=1
# fdist = nltk.FreqDist([tag for tags in tags_cli for tag in tags if tag not in 'climate change'])
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud().generate(' '.join([token for i in docs_cli for token in docs_token[i]]))
plt.figure(1)
plt.figure(figsize=(16, 8), dpi=300)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
wordcloud = WordCloud().generate(' '.join([token for i in docs_ai for token in docs_token[i]]))
plt.figure(2)
plt.figure(figsize=(16, 8), dpi=300)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
NOTES: Since inverted index uses the ID of the terms and documents, you probably have to do the task in a few steps, where you store intermediate results in some data structures or write in temporary files. Any way is fine, as long as you accomplish the task. IMPORTANT: When you write "term_index.csv" and "inverted_index.csv", you have to open the file (for writing) with an optional parameter encoding='utf-8'.
a, 1, 2
aakash, 2, 1
aala, 3, 1
aamodt, 4, 1
aaron, 5, 4
f = open("TED_term_index.csv","w",encoding='UTF-8')
fdist = nltk.FreqDist([token for doc_t in docs_token_stem for token in doc_t])
tfpairs = fdist.items()
i = 1
for tf in sorted(tfpairs):
if (i<10): print(tf[0] + ", " + str(i) + ", " + str(tf[1]))
f.write(tf[0] + ", " + str(i) + ", " + str(tf[1])+'\n')
i+=1
f.close()
1, https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity
2, https://www.ted.com/talks/al_gore_on_averting_climate_crisis
3, https://www.ted.com/talks/david_pogue_says_simplicity_sells
f=open("TED_doc_index.csv","w",encoding='utf-8')
i=1
for url in ted_raw['url']:
if(i<10): print(str(i)+", "+url,end ="")
f.write(str(i)+", "+url)
i+=1
f.close()
1, 1146, 1, 2429, 1
2, 1878, 1
3, 2381, 1
4, 1655, 1
5, 810, 1, 943, 1, 951, 1, 1717, 1
from collections import Counter
word_count_dict = {}
# Built index by appending docId into dict which may include duplicates
for i in range(0,len(docs_token_stem)):
for word in docs_token_stem[i]:
word_count_dict.setdefault(word,[]).append(i+1)
# Calculate freq within docs
# [doc_ids for doc_ids in word_count_dict['grand']]
f = open("TED_inverted_index.csv","w",encoding='utf-8')
i = 1
for key in sorted(word_count_dict.keys()):
c = (Counter(docid for docid in word_count_dict[key]))
fout = str(i)+", "+', '.join([str(i) for item in c.items() for i in (list(item))])
if(i<=10):
print(key)
print(fout)
f.write(fout+'\n')
i+=1
f.close()
This is my first time using NLTK as oppose to Skitlearn Tokenizer. I think NLTK has a more complicated data structure that inherits my characteristics from dictionaries and tuples which makes it ambiguous to call functions. From looking at the documentation of those classes, I was able to complete this assignment. I certainly agree that knowledge in Data Structure is essential in dealing with the NLTK package especially for people who started programming in Python. As for myself, I had quite some background in other languages such as C and Java which really helped me to navigate through those mentioned documentation.
Submit the following: (1) Your answer file; (2) Three output files for problem 3; and (3) Your source code file(s).
(1) Your answer file must:
(2) Three output files must be comma separated.
(3) Your source code file must have your name, the course name (CSC 575) and section number, and the assignment number (HW#2) at the top of the file (in the comment section). If you used Jupyter Notebook, submit the html version of the code in addition to the ipynb file.